Deep Multimodal Learning for Emotion Recognition in Spoken Language
نویسندگان
چکیده
In this paper, we present a novel deep multimodal framework to predict human emotions based on sentence-level spoken language. Our architecture has two distinctive characteristics. First, it extracts the high-level features from both text and audio via a hybrid deep multimodal structure, which considers the spatial information from text, temporal information from audio, and high-level associations from low-level handcrafted features. Second, we fuse all features by using a three-layer deep neural network to learn the correlations across modalities and train the feature extraction and fusion modules together, allowing optimal global finetuning of the entire structure. We evaluated the proposed framework on the IEMOCAP dataset. Our result shows promising performance, achieving 60.4% in weighted accuracy for five emotion categories.
منابع مشابه
The multimodal nature of spoken word processing in the visual world: Testing the predictions of alternative models of multimodal integration
Ambiguity in natural language is ubiquitous (Piantadosi, Tily & Gibson, 2012), yet spoken communication is effective due to integration of information carried in the speech signal with information available in the surrounding multimodal landscape. However, current cognitive models of spoken word recognition and comprehension are underspecified with respect to when and how multimodal information...
متن کاملEmotion Recognition Using Multimodal Deep Learning
To enhance the performance of affective models and reduce the cost of acquiring physiological signals for real-world applications, we adopt multimodal deep learning approach to construct affective models with SEED and DEAP datasets to recognize different kinds of emotions. We demonstrate that high level representation features extracted by the Bimodal Deep AutoEncoder (BDAE) are effective for e...
متن کاملMusic Emotion Recognition via End-to-End Multimodal Neural Networks
Music emotion recognition (MER) is a key issue in user contextaware recommendation.Many existingmethods require hand-crafted features on audio and lyrics. Here we propose a new end-to-end method for recognizing emotions of tracks from their acoustic signals and lyrics via multimodal deep neural networks. We evaluate our method on about 7,000 K-pop tracks labeled as positive or negative emotion....
متن کاملMultimodal Emotion Recognition Using Multimodal Deep Learning
To enhance the performance of affective models and reduce the cost of acquiring physiological signals for real-world applications, we adopt multimodal deep learning approach to construct affective models from multiple physiological signals. For unimodal enhancement task, we indicate that the best recognition accuracy of 82.11% on SEED dataset is achieved with shared representations generated by...
متن کاملModeling affected user behavior during human-machine interaction
Spoken human-machine interaction supported by state-of-theart dialog systems is becoming a standard technology. A lot of effort has been invested for this kind of artificial communication interface. But still the spoken dialog systems (SDS) are not able to provide to the users a natural way of communication. Most part of the existing automated dialog systems is based on a questionnaire based st...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1802.08332 شماره
صفحات -
تاریخ انتشار 2018